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Abstract 

In financial markets, abnormal trading behaviors pose a serious challenge to market surveillance and risk manage- 
ment. What is worse, there is an increasing emergence of abnormal trading events that some experienced traders 
constitute a collusive clique and collaborate to manipulate some instruments, thus mislead other investors by applying 
similar trading behaviors for maximizing their personal benefits. In this paper, a method is proposed to detect the 
hidden collusive cliques involved in an instrument of future markets by first calculating the correlation coefficient 
between any two eligible unified aggregated time series of signed order volume, and then combining the connected 
components from multiple sparsified weighted graphs constructed by using the correlation matrices where each cor- 
relation coefficient is over a user-specified threshold. Experiments conducted on real order data from the Shanghai 
Futures Exchange show that the proposed method can effectively detect suspect collusive cliques. A tool based on 
the proposed method has been deployed in the exchange as a pilot apphcation for futures market surveillance and risk 
management. 

Keywords: Futures markets. Financial trading behaviors. Collusive cliques. Correlation coefficient. Weighted graph. 
Unevenly-spaced time series. 



1. Introduction 

In financial markets, trading behaviors roughly re- 
fer to operations and actions conducted by individual 
investors to buy and sell financial instruments through 
an exchange institute. Although normal trading activi- 
ties are dominating, abnormal market behaviors (for ex- 
ample, price manipulation and circular trading) happen 
now and then, especially in the emerging financial mar- 
kets [1-5]. These abnormal behaviors not only impact 
market running mechanism and pricing mechanism, but 
also threaten the safety of financial markets and hurt the 
interests of righteous investors. What is worse, there is 
an increasing emergence of abnormal trading events that 
for maximizing their personal benefits, some traders 
constitute a collusive clique and collaborate with each 
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other to manipulate the movement of some instruments, 
thus mislead other investors. Collusive trading activ- 
ities [4—6] are becoming a threatening and concealed 
type of financial market manipulations. And discov- 
ering the hidden collusive cliques from numerous mar- 
ket participants and massive trading data poses a tough 
challenge to financial market surveillance and risk man- 
agement, which thus has attracted increasing attention 
of market regulators and researchers in recent years. 
This is reasonable and natural when we consider this 
issue under the situation that the world is still struggling 
from the financial crisis. 

The goal of this study is to detect collusive cliques 
in futures markets based on similar trading behaviors of 
investors. Empirical observation and analysis of trad- 
ing operations of the market participants can provide 
the clue to detecting the collusive cliques in futures trad- 
ing. The members of a clique are usually similar to each 
other in trading behavior while different from the those 
outside the clique. The similar trading behavior indi- 
cates that the members buy or sell a certain instrument 
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roughly at the same time point and even their order vol- 
ume is correlated. On the contrary, the trading behav- 
iors of ordinary (normal) investors who do not belong to 
any collusive clique have little possibility of being cor- 
related. Admittedly, some "clever" traders may attempt 
to take different operations for counteracting collusive 
behaviors, which makes their activities appear just as 
normal investors so that they can escape from being de- 
tected. However, successful disguising needs not only 
high financial operation skills on one hand but also ex- 
tra cost on the other hand, which prevents such collu- 
sive behaviors from happening popularly. This paper 
focuses on detecting the first kind of collusive behav- 
iors where individual investors show (roughly) similar 
trading pattern, and leaves the problem of detecting the 
second kind of collusive behaviors where individual in- 
vestors must not have similar trading fashion as future 
work. 

In this paper, we propose an effective method to iden- 
tify the collusive cliques from numerous market partic- 
ipants. We first select the dataset of real order records 
from the Shanghai Futures Exchange' by conducting a 
comparative analysis on major information of futures 
trading activities. Then, taking signed order volume 
as the characteristic variable of futures trading activ- 
ities, which can reliably reflect the trading intentions 
of investors, we define a unified aggregated time series 
to alleviate the disturbance caused by time difference 
of trading event occurrences, and calculate the corre- 
lation coefficient between any two eligible unified ag- 
gregated series. Next, based on the correlation matrix 
of one trading day, a weighted graph is constructed by 
using the edges whose weights are above a predefined 
threshold. After that, the separate connected compo- 
nents in the weighted graphs of multiple trading days 
are combined into an integrated weighted graph where 
the weight of each edge is the sum of its occurrences 
in different weighted graphs, and these edges whose 
weights below a predefined threshold are given up. Fi- 
nally, the connected subgraphs in the integrated graph 
are taken as suspect collusive cliques. Our method is 
mainly inspired by the empirical observation and analy- 
sis on the real trading data, and we put the first priority 
on the method's practicality in real applications of mar- 
ket surveillance and risk management. 

This paper is organized as follows: In Section 2 we 
provide a survey of some of the related work. The real 
dataset used in our study is introduced in Section 3. Sec- 
tion 4 gives the detail of the proposed detection method 
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and the concrete algorithms. Experimental results are 
presented in Section 5. Finally, Section 6 concludes the 
paper and highlights some future works. 

2. Related Work 

To the best of our knowledge, there is no related work 
that detects collusive cliques in futures markets based on 
similar trading behaviors as studied in this paper How- 
ever, some works have dedicated to the problem of de- 
tecting abnormal trading activities in financial markets 
from different perspectives. For example, price manip- 
ulation, one of major fraudulent trading activities, have 
been investigated by various methods, including pat- 
tern recognition based approach [1], behavioral statistic 
model [2, 4, 7, 8], rational expectation theory of corners 
[9] and domain driven data mining [10]. 

As an emerging kind of abnormal activities in finan- 
cial markets, collusive activities among investors re- 
cently have been investigated from different aspects for 
explaining market manipulation. For distinguishing the 
irregular trading patterns from the regular trading opera- 
tions, Franke et al [11] developed detection approaches 
based on spectral clustering method. They generated 
a trader network to represent the trading behaviors of 
traders and thus characterized the market. If the ac- 
tual market behaviors deviate from the allowed trad- 
ing behaviors in the market, then irregularities are re- 
ported. However, this study was conducted on an ex- 
perimental stock market. Palshikar et al [5] proposed a 
graph clustering algorithm for detecting a set of collu- 
sive traders who have heavier trading among themselves 
compared to their trading with the other traders. They 
constructed stock flow graph with synthetic trading data 
to represent the trading relationships between traders, 
and applied the graph clustering method to find collu- 
sive traders. Cao et al [6] argued that market manipula- 
tion derives from the activities of a group of hidden ma- 
nipulators who collaborate with each other to manipu- 
late three trading sequences: buy-orders, sell-orders and 
trades, through carefully arranging their prices, volumes 
and time. They proposed a a coupled Hidden Markov 
Models(HMM)-based approach to describing the inter- 
active behaviors among group members, and further to 
detecting abnormal manipulative trading behaviors on 
orderbook-level stock data. 

Comparing with these works above, our study in 
this paper has three distinct features as follows: first, 
our work addresses collusive clique detection in futures 
markets, while the existing works all studied irregularity 
discovery in stock markets. Although both futures and 
stock are financial products, their trading mechanisms 
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are quite different. Second, we build weighted graphs to 
characterize the interactions among the investors based 
on their trading behaviors, which is different from the 
existing works that also used graph based approaches. 
Last but not least, our method is inspired by and evalu- 
ated with real order data of futures trading in the Shang- 
hai Futures Exchange. However, most existing works 
(not including Cao et al [6]) evaluated their methods by 
synthetic data. 

In fact, the detection of collusive behaviors has also 
studied in other fields, including online auction systems 
[12, 13], online recommender systems [14-17], online 
reputation systems [18-20] and P2P file sharing net- 
works [21, 22]. Solutions of these systems are effective 
in the respective scenarios, but none of them could be 
applied to the detection of collusive cliques in the fu- 
tures markets for three reasons. First, trading activities 
in the futures markets are complicated. For example, 
every investor can continuously open or close long/short 
positions in any futures contract. Second, there are hun- 
dreds of thousands of order recodes submitted to futures 
trading systems in one typical trading day. Such large 
scale data sets could not appear in most online auction 
systems or reputation systems. Last but not least, behav- 
ioral ratings and interaction between two colluders are 
not the ideal description of collusion behaviors for high 
frequent order sequences in the futures trading systems. 

In our study, one key technique for detecting of 
collusive cliques is the measure of similarity between 
a pair of unevenly-spaced time series. The simi- 
larity of time series has been measured by various 
metrics, including Euclidean distance (ED) [23, 24] 
and more sophisticated metrics, such as Dynamic 
Time Warping (DTW) [25, 26], Edit distance with 
Real Penalty (ERP) [27], distance based on Longest 
Common Subsequence (LCSS) [28], Edit Distance on 
Real sequences (EDR) [29], Spatial Assembling Dis- 
tance (SpADe) [30] and Sequence Weighted ALignmEnt 
model (SWALE) [31]. These representation and dis- 
tance measures mentioned above have been comprehen- 
sively evaluated by comparative experiments in [32] for 
querying and mining of time series database. These 
methods try to identify matching elements between time 
series. However, the trading behavior similarity of two 
investors in collusive clique detection is characterized 
by the conformity and correlation between the pair of 
corresponding time series of signed volume, which em- 
phasizes the shape similarity instead of magnitude simi- 
larity. Time stamp of trading activity is a principal char- 
acteristic to evaluate the similarity between two time se- 
ries of signed volume. Two time series even with the 
same shape happening in different time periods could 



not been considered to be similar For these reasons, 
correlation measurement is more appropriate for our 
study. 

3. TheDataset 

Data is the key to data mining. Understanding the 
data is crucial to the design of data mining algorithms. 
In this section, we will introduce the dataset used in this 
study for detecting collusive cliques in futures markets. 

In futures trading, there are different types of data, 
such as order records, trade results and position 
changes, which can provide clue to describing the trad- 
ing behavior of a market participant. An order is an 
instruction to buy or sell instruments, submitted by an 
investor to the electronic trading platform of the ex- 
change institute. The order record indicates the trad- 
ing intention of the investor to buy or sell how much 
volume of a specific instrument at the price of the mo- 
ment. The eligible orders from buyers and sellers are 
matched according to a certain rule via the electronic 
trading platform, and trade reports are sequentially gen- 
erated for the investors. Both the dealing prices and the 
trade volumes of the transactions are derived from the 
corresponding orders and are dependent on the current 
market situation such as the last prices and the order 
volumes of counterparts. The trading results will lead 
to position changes of the involved investor Therefore, 
both trading results and position changes are the deriva- 
tive consequences of order records, they can only partly 
represent the investors' intentions. However, order in- 
formation can properly characterize the investors' trad- 
ing behaviors. 

The dataset used in our investigation is entirely 
from the real order series of the Shanghai Futures Ex- 
change, which is the largest one in China's domestic fu- 
tures market and has considerable impact on the global 
derivative market. Currently, the electronic trading plat- 
form of the exchange institute receives only limit or- 
ders submitted by the investors. There are hundreds of 
thousands of order records from market investors in one 
typical trading day, which is comprised of the open call 
auction (8:55 - 8:59) and four continuous auction ses- 
sions (9:00-10:15, 10:30-1 1:30, 13:30-14:10 and 14:20- 
15:00). We collect a representative order dataset that 
cover three active futures contracts, including copper, 
fuel oil and natural rubber in the nine trading days from 
Sep 16, 2008 to Sep 26, 2008. The dataset contains 
1,893,519 order records and involves 66,861 market 
participants. 

The statistic information of the order records of the 
three futures contracts is given in Table 1 . A limit order 
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Table 1. The statistic information of tfie order dataset of tlrree 
futures contracts, including copper, fuel oil and natural rubber 
in the nine trading days from Sep 16, 2008 to Sep 26, 2008. 



Futures 


Number of orders 


Number of investors 


copper 


441,104 


19,414 


fuel oil 


650,079 


22,537 


natural rubber 


802,336 


24,910 



denotes the ask indicator with negative sign for sell or- 
der. That is, in a signed order volume sequence, the 
volume of a buy order is positive while the volume of a 
seller order is negative. By using signed order volume 
to describe a trading event of a certain investor at the 
moment of submitting her/his order, a discrete event se- 
quence over a period of trading time can naturally char- 
acterizes the trading behavior of an investor. 

For a futures investor, we denote by v(f,) the signed 
order volume of an order submitted by her/him at time 
tj (/=1, 2, ■ ■ ■ , A^). is the length of the sequence {f,}, 
which may be different for different investors, and the 
time points of the sequences for different investors can 
also be different. Thus, the time series {v(f,)) of the 
signed order volume is an unevenly-spaced event se- 
quence. 

Example 1. Table 2 illustrates the construction of two 
signed order volume sequences from the limit order se- 
quences of two investors #1 and #2. In the table, the 
first five columns are the information fields of limit or- 
ders, and the last column represents the signed order 
volume. The timestamp of limit order is in the format 
using the colon as the separation character 

4.2. Aggregated Time Series 

The event series of signed order volume above is not 
appropriate to calculate the behavior similarity of differ- 
ent investors due to two reasons as follow. On one hand, 
even though two investors belonging to a cUque desire 
to apply the same order strategy, their operations can 
not be accurately synchronous in practice, usually there 
exists a little lag for some reasons (e.g., network speed 
or the queuing policy of the exchange). On the other 
hand, the active speculators such as day traders always 
issue a large number of order records, their long event 
sequences make the computation of behavior similarity 
more complex. Therefore, here we introduce an aggre- 
gated sequence to replace the original signed order vol- 
ume sequence to represent the behavior of an investor 

We specify the size 6, of a time window. Given a 
signed order volume sequence, we split the sequence 
from its starting timestamp into a series of consecutive 
windows (or segments) of length St, each of which is 
labeled by its time index whose value is an positive in- 
teger starting from 0. That is, the first window is labeled 
by 0, the second one is by 1, and so on. For the /-th win- 
dow, its time index is denoted by i,, and it covers the 
scope of time [s/d,, {s/ + 1)6,). We aggregate the signed 
volumes of different orders happening within each win- 
dow into a single value. Concretely, for the /-th window. 



record includes a virtual ID representing the investor, 
bid/ask indicator, order price and volume. All other sen- 
sitive information is filtered out for privacy preservation 
reason. 

4. Methodology 

In this section, we will describe the detail of the 
method employed to detect collusive cliques by calcu- 
lating correlation coefficients between trading series and 
constructing weighted graphs from the correlation coef- 
ficient matrices. The algorithms for implementing the 
method are also given. 

4.1. Selection of the Target Variable 

A limit order refers to an order submitted by an in- 
vestor to buy or sell an instrument at a specific price 
(rather than a market price), thus it contains the fun- 
damental information such as bid/ask indicator, order 
price and order size. For these fields of a limit order 
record, which one can be used as a representative data 
item to describe the trading intention of an investor? To 
answer this question, let us first check these fields in 
detail. 

As a piece of crucial information, the bid/ask indica- 
tor indicates whether the order is a buy limit order or a 
sell Umit order and whether the investor wants to own or 
to abandon the asset. The order price is a specific price 
at which the investor hopes the order will be filled. Gen- 
erally, the price that is close to the latest trade price of 
the market will be immediately filled, and the prices of 
orders submitted during a short period are almost the 
same. Consequently, the price that is dependent on the 
market situation does not distinguish the investors' in- 
tentions. The order volume reflects the amount of asset 
that a investor intends to buy or sell. 

Based on the preceding analysis of different fields in 
limit order records, we decide to combine the order vol- 
ume and the bid/ask indicator into a signed order vol- 
ume as the proper representation of a participant's trad- 
ing intention. A signed order volume sequence denotes 
the bid indicator with positive sign for buy order, and 
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Table 2. The limit order sequences of two investors and the corresponding signed order volume sequences. The first five columns 
are the information fields of limit orders. The last column represents the signed order volume whose positive sign means buy order 
and negative sign means sell order. 



Investor 


Timestamp 


Indicator 


Price 


Volume 


Signed volume 


1 


09:00:30 


Buy 


3211 


2 


2 


1 


09:03:06 


Sell 


3216 


2 


-2 


1 


09:03:12 


Sell 


3214 


I 


-1 


1 


09:08:02 


Sell 


3206 


2 


-2 


1 


09:08:26 


Buy 


3204 


6 


6 


1 


09:10:28 


Sell 


3205 


3 


-3 


2 


09:00:40 


Buy 


3211 


3 


3 


2 


09:03:04 


Sell 


3216 


4 


-4 


2 


09:03:10 


Buy 


3214 


2 


2 


2 


09:08:05 


Sell 


3206 


3 


-3 


2 


09:08:30 


Buy 


3204 


10 


10 


2 


09:12:02 


Buy 


3201 


2 


2 



the aggregation value V(s,) is the sum of all signed vol- 
umes of orders happening in [sj + 1)<5,). Formally, 

2 y(tj). (1) 

SiS,<tj<{Si+l)Si 

Above, v(tj) is the signed volume of an order happen- 
ing at the timestamp tj in the signed order volume se- 
quence under study. Thus, a signed order volume se- 
quence can be transformed to an aggregated sequence, 
denoted by {(i,, V(s,))) (Sj-O, I, 2, •■•). Furthermore, 
we discard any aggregated point st whose aggregated 
value V{si)-Q, then get the final aggregated signed or- 
der volume sequence, which is an aggregated time se- 
ries. 

Example 2. For the data in Table 2, let 6,-60 sec- 
onds and the starting timestamp of the order series is 
09:00:00, we can get the time index sequences of the 
aggregated sequences for the two investors #1 and #2 
are {0, 3, 8, 10] and {0, 3, 8, 12} , respectively. And 
the two signed order volume sequences are aggregated 
into two aggregated signed order volume sequences as 
follows: {(0, 2), (3, -3), (8, 4), (10, -3)} and {(0, 3), (3, 
-2), (8, 7), (12, 2)}. 

In practice, there will be no aggregated data in a cer- 
tain time span if there is no order event occurring at 
all during that period, thus the aggregated time series is 
unevenly spaced. The time window size 6, determines 
the granularity of aggregation and the length of the ag- 
gregated series. By enlarging the window size, buying 
and selling volumes within a window may counteract, 
which thus makes the aggregated value of that window 



smaller and consequently degrades the calculation re- 
sult. Therefore, a reasonable time window size is criti- 
cal to the calculation of behavior correlation coefficient. 

Furthermore, the collusive investors tend to fre- 
quently place orders to influence the market, they easily 
become the active traders in the market. Consequently, 
the investors with few orders will very possibly be ex- 
cluded from the detected potential collusive cliques be- 
cause they will not be highly correlated with these in- 
vestors who have more orders. To reduce the unnec- 
essary computation and thus boost efficiency, we filter 
out some investors who have few orders before corre- 
lation coeflicient computation. Concretely, we compare 
the length of each aggregated time series with an em- 
pirical threshold {6l), and only these with a length no 
shorter than the threshold are kept for further process- 
ing. We call these aggregated time series eligible aggre- 
gated signed order volume series. So only the eligible 
aggregated signed order volume series will be used for 
correlation coefficient computation and potential collu- 
sive cliques detection. 

4.3. Unified Aggregated Time Series and Correlation 
Measure 

The trading behavior similarity between two in- 
vestors is evaluated by the strength of association be- 
tween the corresponding aggregated time series, which 
is commonly measured by correlation coeflicient [33, 
34]. In statistics, correlation coefficient is used as an in- 
dicator of the degree to which an event or phenomenon 
is associated with, related to, or can be predicted from 
another, as well as a strength measure of linear relation- 
ship between two variables. It has been widely applied 
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in the financial area. We adopts correlation coefficient 
as the similarity measure of two investors' trading be- 
haviors. 

Let us consider two investors A and B, and their ag- 
gregated time series are denoted by Va and Vb, respec- 
tively. Both aggregated time series are unevenly spaced 
and discrete, and their time indices sf and sf with the 
same subscript / are not necessary the same, thus the 
methods for evenly-spaced time series are not applica- 
ble here. We merge their time index sets and into 
a unified time index set s, i.e., s - U s*. Based on 
the unified time index set, we define the unified aggre- 
gated time series of signed order volume of investor A 
as follows: 



UAiSk) 



VAisk), sk e s^ 
0. otherwise 



(2) 



Similar definition is applied to the aggregated time se- 
ries Vb of investor B, and we get the unified aggregated 
time series Ub of investor B. 

Example 3. Following Example 2, we go further to 
compute the unified aggregated time series of investor 
#1 and #2. The unified time index set is {0, 3, 8, 10) U 
{0, 3, 8, 12} = {0, 3, 8, 10, 12). According to Equation 2, 
we have the unified aggregated time series U\ and U2 as 
follows: Ui = {(0, 2), (3, -3), (8, 4), (10, -3), (12, 0)}, 
U2 = {(0, 3), (3, -2), (8, 7), (10, 0), (12, 2)}. 

The correlation coefficient is evaluated between uni- 
fied aggregated time series. For Ua and Ub, their corre- 
lation coefficient rAB is defined as follows: 



rAB 



< UaUb > - <Ua><Ub> 

^(< UI>-<Ua >2)(< UI>-<Ub >2) 

(3) 



where the angular brackets < ■ • ■ > represents the av- 
erage over all the aggregated events (or points) in the 
series. The correlation coefficient r is between -1 and 1. 
A positive r value indicates the existence of positive cor- 
relation, while a negative r value implies negative corre- 
lation. A zero r means no correlation and the two time 
series are independent from each other For collusive 
clique detection, negative correlation is little significant 
because it means the trading behaviors of two investors 
are almost opposite. In fact, only positive correlation is 
of significance for collusive cliques detection. 

Example 4. Following Example 3, according to 
Equation 3, the correlation coefficient between the two 
unified aggregated time series U\ and U2 is 0.956, 



which means that the trading behaviors of the two 
investors are strongly positive correlated. 

When considering investors, the correlation coeffi- 
cients between any two investors i, j build a correlation 
matrix R, which is an N x N matrix where the entry 
rjj indicates the correlation coefficient between two uni- 
fied aggregated time series U, and Uj. The correlation 
matrix is symmetric because the correlation between J/,- 
and Uj is the same as the correlation between Uj and 
Ui. The diagonal elements in the matrix are the self- 
correlation coefficients of all unified aggregated time se- 
ries, and the values are 1 . 

4.4. Discovery of collusive cliques 

With the correlation coefficient matrix, we construct 
a weighted graph in which a node represents an investor 
in the market, and an edge is added to connect two nodes 
if the correlation coefficient between the two nodes' cor- 
responding unified aggregated time series is larger than 
a user-specified threshold {6k). Note that the weighted 
graph constructed does not contain loop edges and there 
is no more than one edge between any two nodes, and 
the weight of each edge is the correlation coefficient. 
The resulting graph is not necessary a connected graph, 
very possibly it may includes some isolated nodes and 
some connected components (subgraphs). An isolated 
node has no link to any other nodes, and a connected 
component may be a complete graph, which means that 
all nodes in the component are highly similar to each 
other but weakly similar to the other nodes outside the 
component. Obviously, the connected components con- 
form to the criterion of potential collusive cliques. Cer- 
tainly, the value of correlation coefficient threshold 5„ 
will surely influence the number of resulting connected 
components. As the threshold increases, the number 
of resulting connected components will reduce, and the 
detected result will be more reliable but some suspect 
traders may be neglected. On the contrary, when de- 
creasing the threshold, the number of false collusive 
cliques (noise) will rise, which will degrade the detec- 
tion precision. Therefore, a proper threshold 5h is of 
substantial importance to guarantee the detection per- 
formance. 

It is always the case that we do not know how many 
collusive cliques exist and who belongs to which clique 
in the market. Fortunately, some practical observation 
can help us to make the decision. That is, a collusive 
clique will conduct cooperative and abnormal actions 
repeatably. So if a doubtful clique happens in multiple 
trading days, it is reasonable to believe that it is a sus- 
pect collusive clique. Thus we consider multiple con- 
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tinuous trading days, and construct one weighted graph 
for each trading day, then combine the connected com- 
ponents in the daily graphs into an integrated weighted 
graph in which the weight of each edge is the sum of its 
occurrences in different weighted graphs. Finally, the 
connected subgraphs in the integrated graph are output 
as suspect collusive cliques by eliminating the isolated 
nodes and the edges whose weights are below a prede- 
fined threshold (6/). 

4.5. The algorithms 

Our method of collusive clique detection by similar 
trading behavior analysis mainly consists of two stages; 

• Computing the unified aggregated time series of 
signed order volume for each investor, and calcu- 
lating the correlation coefficient matrix based on 
all eligible unified aggregated time series, and 

• Identifying suspect collusive cliques by combining 
the connected components in the weighted graphs 
of multiple continuous trading days into an inte- 
grated weighted graph. 

We develop two algorithms to implement the tasks of 
the two stages above. They are outlined in Algorithm 1 
and Algorithm 2. 

Algorithm 1 aggregates the signed order volumes of a 
market investor in every time window of a trading day to 
a single value, and filters out the short aggregated time 
series, and then calculates the correlation coefficient be- 
tween any two unified aggregated time series. This al- 
gorithm's input is the preprocessed order records set of 
one futures contract in a single trading day and each 
order record includes the investor's virtual ID, signed 
order volume and a second-based timestamp converted 
from the time format that uses the colon as separation 
character 

With the correlation coefficient matrix. Algorithm 2 
is developed to detect collusive cliques. It first con- 
structs one weighted graph for each of trading day, 
and then merges the connected components of the daily 
graphs into an integrated weighted graph. For each con- 
nected component in the integrated weighted graph, if 
the weights of all its edges are no less than the threshold 
5f, then the connected component is output as a suspect 
collusive chque. 

5. Experiments and Discussions 

In this section, we will present the experimental re- 
sults with real order data of three futures contracts from 



Algorithm 1: Calculating correlation coefficient 

matrix 

Input: Order record set D of one futures contract in 
one trading day, time window size 6t, length 
threshold 5l of aggregated time series 
Output; Correlation coefficient matrix R 
T := 0; 

for each investor p do 

Extract time series of signed order volume 
from D; 

Aggregate Vp by summing up signed order 
volumes in each time window s of size 5,. The 
aggregated time series is denoted as Vj,; 
if \Vp\ >= 6l then 
I Add Vp to T; 
end 

end 

for each V,, Vj eT,i ^ j do 

Merge the two time index sequences s' and s^ 

into an unified time index sequence s with 

s — sort(s' U s^); 

Unify y, based on s into [/, by 

msk)} = {Vi(sk)\sk e s'} U {0\sk es,skt s'}. 

Unify Vj into Uj in the same way; 

Calculate correlation coefficient r,j between t/, 

and Uj according to the following formula: 

_ <U,Uj>-<Ui><Uj> 

^<U;-<Ui>-><Uj-<Ui>-> ' 

end 

Output R; 



the Shanghai Futures Exchange. The experimental re- 
sults confirm the eflFectiveness of the proposed method 
in detecting collusive cUques. 

5.7. The Effect of Time Window Size 6, 

In aggregating time series of signed order volume, 
the length of time window 6, is an important parame- 
ter that will directly influence the correlation coefficient 
calculation. For examining the impact of window size 
Si on correlation coefficient, we choose two time series 
of signed order volume from the trading data of fuel 
oil futures on September 25, 2008, which are shown 
in Fig. 1(a). We aggregate the two time series with dif- 
ferent sizes of time window. Fig. 1(b) shows the ag- 
gregated time series with the time window size (5,=60 
seconds. Then we calculate the correlation coefficient 
between the two resulting aggregated time series with 
the time window size increasing from 1 to 200 seconds. 
The results are shown in Fig. 2. We can see that the 
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Algorithm 2: Detecting collusive cliques 

Input; Con-elation coefficient matrix set {/?''} for 
one futures contract in multiple trading 
days, threshold 6u. for constructing 
weighted graphs, edge weight threshold 6/ 
for the integrated weighted graph. 
Output: Candidate collusive cliques 
for each correlation matrix R'' do 

Construct a simple weighted graph using R"^ as 

the adjacent matrix, in which an edge exists if 

its weight is greater than 6^; 

Obtain the connected components set S of the 

graph; 

end 

Merge {S^'] into an integrated weighted graph G, in 
which the weight of each edge is the sum of its 
occurrences in {5''); 

Eliminate the edges whose weights are below 6 f. 
The connected subgraphs in G are output as 
potential collusive cliques; 



correlation coefficient increases as the size of time win- 
dow enlarges, and it reaches asymptotically to a stable 
value at about 60 seconds. The result is analogous to 
the Epps effect [35-38] that the stock return correlation 
decreases as the sampling frequency of data increases. 
From Fig. 2, we argue that a time window of size 60 
seconds is a reasonable choice in our experiments. 

5.2. Determining the Length Threshold (5l) of Aggre- 
gated Time Series 

For the whole data set, the cumulative distribution 
function F{L)-P{L' < L) of the length L of aggre- 
gated time series is shown in Fig. 3. As the figure 
shows, about 90% time series are less than 15 in length 
and are excluded from correlation coefficient calcula- 
tion. There are only 10% investors included in collusive 
detection, which reduces the complexity of correlation 
calculation. Therefore, we choose 15 as the empirical 
threshold {5l) value for filtering the short aggregated 
time series, which means that an investor should have 
placed orders in at least 15 time windows in a trading 
day to be included in the collusive clique detection pro- 
cedure. This choice conforms to the long-term surveil- 
lance practical experience in the exchange institute. 

5.3. The Effect of The Correlation Coefficient Thresh- 
old 6h, 

Now, we consider the order record data of the cop- 
per futures contract in one typical trading day (Septem- 




6000 7000 8000 9000 10000 

Time (seconds) 




100 120 140 160 

Time index 



Fig. 1. (a) The time series of signed order volume of two in- 
vestors and (b) the corresponding aggregated time series with 
6, =60 seconds. The aggregated time series with less data 
points retain the profile of the original time series. 



ber 18, 2008) to demonstrate the process of collusive 
clique detection. After aggregating and filtering the 
time series of signed order volume, we obtain 819 eligi- 
ble aggregated time series for computing the correlation 
coefficient matrix Mc. 

We construct four weighted graphs based on M^- with 
different correlation coefficient threshold values. In 
Fig. 4, the number of connected components are 10, 
8, 6 and 4, corresponding to the threshold values 0.80, 
0.85, 0.90 and 0.95, respectively. The number of result- 
ing connected components gradually decreases as the 
threshold value grows. We notice that the connected 
component with six nodes is (almost) a complete graph 
in all the sub-figures. The reason is that the similar- 
ity between any pair of nodes in the component is very 
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Fig. 2. The impact of time window size S, on correlation 
coefficient when aggregating two time series of signed order 
volume. Each circle (in blue) indicates a correlation coeffi- 
cient value at a certain window size. The average line (in 
red) can more evidently demonstrate the trend after smoothing 
data fluctuation. About 60 seconds are needed for the correla- 
tion coefficient to reach its asymptotically stable value, which 
means that it is reasonable to choose 60 seconds as the size of 
time window in our experiments. 

large. In practical applications, the supervisors of the 
exchange institute can choose different threshold values 
according to real surveillance requirements to observe 
the suspect investors in different monitoring levels. 

5.4. The Performance of the Proposed Method 

According to the experimental results above, we 
choose the following parameter values for the detec- 
tion method: 5,=60 seconds, (5l=15, Sw-Q.9Q and 5/=2. 
We construct the daily weighted graphs of the three fu- 
tures contracts (see Table 1) in nine consecutive trad- 
ing days, and merge the connected components occur- 
ring at least twice in the daily graphs into an integrated 
weighted graph. We illustrate the integrated weighted 
graphs for the copper, fuel oil and natural rubber fu- 
tures contracts respectively in Fig. 5(a), Fig. 5(b) and 
Fig. 5(c). There are eighteen connected subgraphs in 
these figures. We can see that all the subgraphs are com- 
plete graphs except the two ones {22069, 12633, 1680, 
33473, 3956) in Fig. 5(b) and {24139, 21244, 29020} 
in Fig. 5(c), and most subgraphs just appear twice in 
the nine trading days, while the four subgraphs includ- 
ing {24686, 28000) in Fig. 5(a), {12509, 21255, 11668) 
in Fig. 5(b), {1680, 3203, 4324, 10032, 12633, 17891, 
22069) and the largest component in Fig. 5(c) occur at 
least three times. This means that these subgraphs can 



Fig. 3. The cumulative distribution F{L) of the length L of 
aggregated time series over the whole data set. About 90% 
time series are less than 15 (time windows) in length and are 
excluded from correlation coefficient calculation. 

be considered as suspect collusive cliques. The four 
subgraphs occurring more than three times can be more 
confidently regarded as collusive cliques. 

Furthermore, by carefully checking the figures, we 
notice that the set of investors {1680, 12633, 22069, 
4324, 3203, 7891) forming a connected subgraph 
in Fig. 5(a) and Fig. 5(c), and part of it {1680, 12633, 
22069) appears in Fig. 5(b). In addition, the two sets of 
investor {3956, 33473) and {4162, 4937, 4987) appear 
in Fig. 5(b) and Fig. 5(c), and the two sets of investors 
{3956, 33473} and {1680, 12633, 22069) are correlated 
in the fuel oil futures for they unify together to a single 
subgraph in Fig. 5(b). So we assert that these investor 
sets form collusive cliques with high probability, which 
will be further confirmed by related background data. 

The experimental results for all the three futures con- 
tracts are summarized in Table 3. The average number 
A^„ of eligible aggregated time series in all the trading 
days are much smaller than the number of correspond- 
ing investors in Table 1. This indicates that a large 
number of short aggregated time series are excluded 
by the filter threshold 6l and only the active investors 
are kept for further processing. In Table 3, there are 
many connected components that occur only once in the 
nine trading days, though our method will not classify 
them into suspect collusive cliques, the exchange insti- 
tute still needs to pay attention to them in the following 
trading days. Certainly, these detected suspect collusive 
cliques should be further probed and confirmed via the 
regulatory procedure of the supervision system in the 
exchange institute. In practice, these suspect cliques. 
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Fig. 4. The weighted graphs obtained by using different threshold <5„. values for the copper futures contract in a trading day. The 
number of resulting connected components reduces as the threshold value increases. The threshold value is adjusted for different 
monitoring levels in practical surveillance requirements. 



even not confirmed, will be added to the "black" list of 
the surveillance system. 

Up to now, we have found some suspect collusive 
cliques. Are they really collusive cliques? or can we 
give some explanation on why they are treated as col- 
lusive cliques. For this purpose, we firstly examine the 
detected suspect collusive cliques carefully against the 
surveillance archival data of the exchange. The archival 
data covers the background information of all investors 
and companies involved in the futures market. The find- 
ings are interesting and promising. For most suspect 
collusive cliques, their members are interrelated in one 
or another way. They either come from the same com- 
munity of a city or belong to the same company, or even 
they are from a family. We also find that the accounts of 
some cliques are controlled and operated by a backstage 
manipulator The interrelation information impUes great 



possibility to concert trading actions of members in a 
clique. Now we come to the final step of this study: val- 
idate the detected suspect collusive cliques in terms of 
verified collusive cliques of the surveillance system and 
judgement of experienced domain experts from the ex- 
change institute. There are seventeen suspect collusive 
cliques verified as collusive cliques. The numbers of 
verified collusive cliques(the column A^, in Table 3) are 
4, 4 and 9 for the futures contract copper, fuel oil and 
natural rubber, respectively. Furthermore, we tracked 
and analyzed the order records of the members of these 
cliques, and the verified results were reconfirmed. The 
only detected suspect collusive clique that is not verified 
is from the fuel oil futures contract. The reason is that 
we can not find enough evidence. For privacy reason, 
we can not provide any more detail of these detected 
cliques. 
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Table 3. The experimental results of three futures contracts 
(including copper, fuel oil and natural rubber) in nine consec- 
utive trading days. Na'- the average number of eligible ag- 
gregated time series in the nine trading days; A'^,: the number 
of the connected components in all weighted graphs; A',: the 
number of the connected subgraphs in the integrated weighted 
graph, i.e., the number of detected suspect collusive cliques; 
A',: the number of the verified collusive cliques by surveillance 
archival data of the exchange institute. 



Futures 


Na 


Nc 




N, 


copper 


480 


14 


4 


4 


fuel oil 


955 


14 


5 


4 


natural rubber 


1123 


20 


9 


9 



can also be applied to investigating other behavior sim- 
ilarity of investors, for example, position changes per 
trading day. Experimental results validate the effective- 
ness of the proposed method. As a pilot application, a 
tool based on the proposed method has been deployed in 
the Shanghai Futures Exchange, to assist futures market 
surveillant and risk management. 

As for future work, we are considering to further op- 
timize the method by utilizing the data of two neighbor- 
ing time windows for balancing the uneven data distri- 
bution. We also plan to take into account more trading 
information such as canceled orders and trade reports 
to enforce the information for detecting collusive. Fur- 
thermore, we will explore effective approaches to de- 
tecting collusive cliques that show "different" trading 
behaviors. 
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Fig. 5. The integrated weighted graphs of the copper (a), fuel oil (b) and natural rubber (c) futures by combining connected 
components of daily weighted graphs in nine consecutive trading days. The weight of each edge is the sum of its occurrences in 
each daily weighted graphs. Only those edges with weight no less than 2 are survived. Eventually, four (for copper), five (for 
fuel oil) and nine (for natural rubber) connected subgraphs are obtained, which will be output as suspect collusive cliques by our 
method. For the largest connected component at the bottom-right part of the figure (c), the edge weights are not shown due to space 
limit on the figure. However, we have computed the average value of its edge weights, which is 3.28. 
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